Mining Very Large Databases
نویسندگان
چکیده
38 Computer E stablished companies have had decades to accumulate masses of data about their customers , suppliers, and products and services. The rapid pace of e-commerce means that Web startups can become huge enterprises in months, not years, amassing proportionately large databases as they grow. Data mining, also known as knowledge discovery in databases, 1 gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases, however, are much too large to be held in main memory. Retrieving data from disk is markedly slower than accessing data in RAM. Thus, to be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if—given a fixed amount of main memory—its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data-mining algorithms to very large data sets. In this survey, we describe a broad range of algorithms that address three classical data-mining problems: market basket analysis , clustering, and classification. A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity—for example , a customer's visit to a grocery store or an online purchase from a virtual store such as Amazon.com. Retailers accumulate huge collections of transactions by recording business activity over time. One common analysis run against a transactions database is to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of transactions that contain it. Businesses can use knowledge of these patterns to improve the placement of items in a store or the layout of mail-order catalog pages and Web pages. An itemset containing i items is called an i-itemset. The percentage of transactions that contain an itemset is called the itemset's support. For an itemset to be interesting, its support must be higher than a user-specified minimum; such itemsets are said to be frequent. Figure 1 shows three transactions stored in a rela-tional database system. The database has five fields: a transaction identifier, a customer identifier, the item purchased, its price, and the transaction date. The first transaction shows a customer who …
منابع مشابه
Mining Very Large Databases Using Software Agents
Some databases are simply large (e.g., with ter_abytes of data). A number of potential and diverse patterns in databases would sufficiently be mined so as to support various decisions in applications. This research advocates a re-recognition and re-cogitation to how to discover very large databases. It also presents a new technique mining very large databases will be developed on different hier...
متن کاملMining Spatial Data Using An Interactive Rule-Based Approach
With the advent of very large spatial databases, it is beyond human capacity to examine and understand the information contained within such volumes of data directly. Although data mining has been recognized as a key means of finding patterns in large databases, general data mining methods alone are not sufficient for spatial data mining. The strengths of the computer’s ability to perform numer...
متن کاملH-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases
Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as dense vs. sparse, long vs. short patterns, memory-based vs. disk-based, etc. In this study, we propose a simple and novel hyperlinked data stru...
متن کاملNew Parallel Algorithms for Frequent Itemset Mining in Very Large Databases
Frequent itemset mining is a classic problem in data mining. It is a non-supervised process which concerns in finding frequent patterns (or itemsets) hidden in large volumes of data in order to produce compact summaries or models of the database. These models are typically used to generate association rules, but recently they have also been used in far reaching domains like e-commerce and bio-i...
متن کاملA Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis
Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...
متن کاملAn incremental mining algorithm for maintaining sequential patterns using pre-large sequences
Mining useful information and helpful knowledge from large databases has evolved into an important research area in recent years. Among the classes of knowledge derived, finding sequential patterns in temporal transaction databases is very important since it can help model customer behavior. In the past, researchers usually assumed databases were static to simplify data-mining problems. In real...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Computer
دوره 32 شماره
صفحات -
تاریخ انتشار 1999